(12) INTERNATIONAL APPLICATION PUBLISHED LNDER THE PATENT COOPERATION TREATY (PCT) 




(iO) international Publication Number 



PCX wo 01/67803 Al 



(19) World Intellectual Property Organization 

International Bureau 

(43) International Publication Date 
13 September 2001 (13.09.2001) 



(51) International Patent Classification^: H04Q 11/04, 

H04L 12/56 

(21) International Application Number: PCT/G BO 1/00972 

(22) International Filing Date: 6 March 2001 (06.03.2001) 



[GB/GB|: The Old Police House, Park Road. Grundis- 
burgh, Woodbridge. Suffolk 1P13 6TP (GB). 

(74) .\gent: WILSON, Peter, David: BT Group Legal Ser- 
vices. Intellectual Property Dept.. Holbom Centre 8ih tloor. 
120 Holbom., London EC IN :TE (GB). 



(25) Filing Language: 

(26) Publication Language: 



English Designated States (nationahi CA. US. 



Enelish 



(30) Priority Data: 
0006084.8 



(84) Designated States (regionaln European patent (AT, BE, 
CH. CY. DE. DK, ES, PL FR. GB. GR. IE, IT, LU. MC. 
NL, PT, SE, TR). 



10 March 2000 (10.03.2000) GB published: 

— with imernadorial search report 



(71) XppXxcMXffor all designated States except US): B^iTASH — before the expiration of the lime limii for amending the 



TELECOIVLVIUNICATIONS PUBLIC LIMITED 
COMPANY [GB/GBI; 81 New eaie Street. London ECIA 
7Aj (GB). 

(72) Inventor; and 



claims and to he republished in the event of receipt of 
amendments 

For nvO'Ietter codes and other abbreviations, refer to the "Guid- 
ance Sotes on Codes and Abbreviations'* appearing at the begin- 



= (75) Inventor/Applicant ifor US onlyji HILL, Alan, .Michael ning of each regular issue of the PCT Gazette. 



(54) Title: PACKET SWITCHING 




< 

00 
r- 

so 

O 



1 


a^i.i 


1 1 1 1 1 f s 


2 


'-''1.2 


^ Mill 1 




(57) .Abstract: A mediod of allocating switch request.i within a packet switch, the method comprising the steps of establishing 
switch request data at each input pon: processing the switch request data for each input port to generate request data for each input 
port-output port pairing: comparing the number of requests from each input pon and to each output port with the maximum request 
capacity of each input port and each output port: allocating all requests for those input-output pairs where the total number of requests 
IS less than or equal to the maximum request capacity of each input port and each ouptpui port: reducing the number of requests for 
those input-output pairs where the total number of requests is greater than the maximum request capacity of each input port and each 
output port such that the number of requests is less than or equal to the maximum request capacity of each input pon and each output 
port: and allocating the remaining requests. Pack'cts may be switched from an input port to a specitled output port in accordance 
with the allocations obtained with the above method. 
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PACKET SWITCHING 



T.is inv,n„on ,e.a,es to pacKet swl,ch,n, .o, eel, swHching.. in particular 
„e,hpds .or anocating raoues.. .or switching .ror. one 0, -he Inputs o. a pac.e, 
S switch to one of .he oo.puts of the packet switch. 

,r,pu,-hu„ered eel, sw.tches an. packet router, are potentially the high s 
possiPle »ana.ia,h switches for a„v ,lven fa.hc an. .emorv '-^^ " 
Lch devices reuuira schedul.ng algorithn^s to resolve input and output 
: ent,ons. T.vo approaches to packet or cell scheduling e.st ,see. ,or — , 
,0 A Hung . -ATM input-Puffered switches v,„h the --^ntee -ra . pr r V 
and A Hung e, a,, Proc. IEEE ISCC -93, Athens, Jul 1993. PP 331-3351. f t 
pproach PP«es a. the connect,on.,eve,, where h.ndw,dth guarantees are 
ad. A u,taP,e a,gor,.h. ™st satls,v two conditions for this: firstiv it ™s 
ensure no overPooking for ,11 of the input ports and the output ports and 
„ second, the fapnc arPi.ration prohlem ntust ha solved hv ..coating all the 
requests for time slots in the frame. 

F.bnc arbitration has ,0 date been proposed by means c, the Slepian 
ouguld approach and Paull's theorem for rearrangeably non-blocking^ circu.. 
"Lhed cos networks ,see Chapter 3, . V Hul. Switching and trafhc theory .or 
ao nZrated broadband networks. K.uwar Academic Press. 1990,. "~ 
.ve 3,gor,thm can be summansed as firstly ensuring np o-^co n 
secondly perform.ng fabr.c arbitration by means o, crcult-switching pat. arch 
,gpn,h.s. „ has been assumed that this algorithmic approach could o hr be 
applied at the connection ,evel, because o, its large computational =o-P'e-V^ Fo 
as h a reason, proposais for scheduling of connectionless, bast-e„orts packet o^ 
cells emoloy various match.ng a,gor,thms, many re,a,ed to the "marnaga problem^ 
,see 0 Gale and L S Shaplay, "College admissions and the stability of marriage. 
Mathematical Mcn.hly. 69, 9-15 ,1962, and 0 Gusfeld and BW '-"9, The - 
Marnaga Problem: Structure and Algorithms, M,T Press, ,939, m which th P t 

30 Itpu, connections for each time slot or phase o, the switc 

ndependentlv, ,.e. a frame o, time slots .and hence p no. amp oyed^ 

Mtftpugh such algorithms .or choosing a set o, conflict-free connections « n 
,nputs and outputs for each time slot, which ere based on ma»mum size 
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maximum weight bipartite graph matching algorithms, can achieve 100% 
throughput (N McKeown era/, "Achieving 100% throughput in an input-queued 
switch," Proc. IEEE Infocom 96, March 1996. vol.3, pp. 296-302) they are also 
impractically slow, requiring running times of complexity O(N^logN) for every time 
5 slot fR E Tarjan, "Data structures and network algorithms." Society for Industrial 
and Applied Mathematics, Pennsylvania, Nov. 19S3). 

Iterative, heuristic, parallel algorithms such as iSLIP are known, which 
reduce the computing complexity (i.e. time required to compute a solution) for 
best-efforts packets or cells IN McKeown et al. "The Tiny Tera: a packet switch 
10 core. ' IEEE Micro Jan/Feb 1997, pp 26-33). The iSLIP algorithm is guaranteed to 
converge in at most N iterations, and simulations suggest on average in fewer 
than logjN iterations. Since no guarantees are needed, this and similar algorithms 
currently represent the preferred scheduling technique for connectionless data at 
the cell level in input-buffered cell switches and packet routers with large numbers 
15 of ports (e.g. N>10). The iSLIP algorithm is applied to the Tiny Tera packet 
switch core, which employs Virtual Output Queueing (VOQ), in which each input 
port has a separate FIFO (First In, First Out) queue for each output, i.e. N' FIFOs 
for an NxN switch. If we assume that each FIFO queue stores at least a number 
of cells equal to the average cell latency L, and that each cell is a 53-byte ATM 
20 cell, then the total input FIFO queue hardware count is 0(424LN'). With each 
element capable of clocking out 424f bits per frame, this is a complexity product 
of 0((424)-fLN^), which is a very large complexity. Fortunately, by employing a 
single queue in the form of RAM in each port, acting as N virtual queues, the 
hardware count can be reduced to 0(424LN). and with parallel readout reducing 
25 the number of steps per frame to just f, the overall complexity product can be 
reduced to 0(424fLN). Table 1 gives the hardware and "computing" steps for 
these queues to provide f cells within a frame. 

For unicast packets tne iSLIP algorithm converges in at most N iterations, 
where N' is the number of input and output ports. On average the algorithm 
30 converges in fewer than log:N iterations. The physical hardware implementation 
employs N round-robin grant arbiters for the output ports and N identical accept 
arbiters for the input ports. Each arbiter has N input ports and N output ports, 
making N' links altogether. The total amount of hardware depends on the precise 
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construction of the round-rooin arb.ters. M McKeown er al. op at. employ a 
priority encoder to ,dent,fv the nearest request from the port closest to a pre- 
determined highest-priority Port ,see Figure 1>. The priority encoder reduces the 
number of links down to ,og.N parallel links, in order to change the po.nter .f 
5 required. The log.N parallel links are then expanded back up to N links aga.n 
through a decoder. Details of the haroware comp.ex.ty of the arbiters are given m 
N McKeown, Scheduling Algorithms for Inout-Queued Cell Switches. PhD Thes.s. 

university of Californ.a, BerKeley, 1995. The growth rate for the complete 

scheduler is O(N^). each arbiter be.ng O.N^,. For a 32x32 cel. switch (which is the 
.0 size of the Tiny Tera switcn,, 421. 4C3 2-,nput gates are required. Th.s mav oe 

quite acceptable for sucn a small switch, but the 0(N^. growth rate is extremely 

large. 

,n order to minimise the overai: hardware and computing comp.exity. me 
best structure for constructing the encoder ,s a binary tree, which requires 0.2N) 
15 elements .for large N, and only log.N steps per deration, whilst the decoder needs 
only 0(N) elements. Pipelin.ng cannot be employed to reduce to one step per 
.teration, because the pointers cannot oe up-dated until the single-bit requests 
have passed through both sets of aroiters to the oecision register. The total 
hardware and computing complexities are given below in Table 1 . The hardware 

-= niM-, rather rhan QtN') due to the binary tree encoaer 
20 complexity now grows as OiN ) rainer .nan u inj ). u 

and decoder. 



Hardware i Computing Steps per ' Hardware.Comput.ng ^ 

Complexity Product 




Input RAM 
queues 
Average 
Convergence 



Guaranteed 



0(4fN(1 - log:N)) 



0(24fN^(1 ^ log2N)) 




Table 1 . Hardware and computing complexities of the i^Li^^ algorithm for 
scheduling f packets per cort in a frame of f time slots. 



25 
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The overall hardware. computing complexity product 0(24fN^(l - logiN)) of the 
iSLIP algorithm for scheduling f packets per port would be no less than that of the 
maximum size and weight matching algorithms of N McKeown, et al "Achieving 
100% throughput in an input-queued switch," Proc. IEEE Infocom '96, March 
5 1996. vol.3, pp. 296-302.. if convergence must be guaranteed. There is a 
reduction to 0(24fN^log2N(1 - log^N)) for the average number of computing 
steps. The major benefit of the iSLIP algorithm is its parallel nature, which allows 
the number of computing steps to be traded against hardware complexity, thus 
reducing computing times by a factor N* at the expense of increasing the 
10 hardware by the same factor. It is interesting to note that hardware quantities for 
the input RAM queues far exceed those needed for the scheduling electronics. 

According to a first aspect of the invention there is provided a method of 
a method of allocating switch requests within a packet switcn. the method 
comprising the steps of 
1 5 (a) establishing switch request data at each input port; 

(b) processing the switch request data for each input port to generate 
request data for each inpijt port-output port pairing; 

(c) comparing the number of requests from each input port and to each 
output port with the maximum request capacity of each input port and each 

20 output port; 

(d) allocating all requests for those input-output pairs where the total number 
of requests is less than or equal to the maximum request capacity of each input 
port and each output port; 

(e) reducing the number of requests for those input-output pairs where the 
25 total number of requests is greater than the maximum request caoacity of each 

input port and each output port such that the number of requests is less than or 
equal to the maximum request capacity of each input port and each output port; 
and 

(f) allocating the remaining requests. 

30 According to a second aspect of the invention there is provided a method 

of allocating switch requests within a packet switch, the method comprising the 
steps of; 

(a) establishing switch request data at each input port; 



pCT/GBOl/00972 

WO 01/67803 



,„ „,ocessln. ..e sw„cH ™.u..^a« ,o, eac. :npu, po. to ,ene,«e 

:r;i3, .... .e,. .... - — ^e... 

allocating further switch requests by the iterative pp 
.a..u. .pp.. Papacuv c, eac. o.pp. PP- - 
The present inv.ntipn additicna„v prPv.daa a n.e,hPd p. aUocat . 
,ep..s,s withIn a pap.e, sw„c., the „,a,.od cp.pns.ng the stepa P.: 

e.aPnshing switch ,.pue., data a, each .pu. po„: ^^^^ 
p,epessing the s»-.tch repu.s, data ipr each ,nput por, .P 
■epuest data .pr each ihput ppn-putp.t po. pa.ring: ^ 
identifying a fitst swKCh request from each pf the ,nput 

pairing request data; application ot step tc) 

identifying further switch requests by the iterapye 
ai. p. the switch request data has Peen idenhf.ed: ^^^^ 
subiect tp the maximum request capac.ty of each .npu, pP 

^11 of rhP identified switch requests: and 
output port, allocating all of the .dent, ^^.^^^ 

reserving unallocated sw.tch requests for use .n the 
20 request allocation. 

The .pyehPon w,., npw Pa descPad wirh reference tp the fpiiow.ng 
'Cr; is a scherpapc depiction pf a .hpwn arrangement fpr a,.»cating 

25 switch requests; ;,onaratus for counting switch 

Figure 2 is a schematic deptct.on of an apparatus 

requests according to the present invention: 

Figure 3 is a schematic dep.ction of a second apparatus 
switch requests according to the present ,nver.tion: ^^^^^^^ 
Figure ^ is a schematic depiction of an apparatus for count, g 
^ f . nacker switch according to the present 

requests for each output port of a packer 



10 ^a) 
ib) 



15 id) 



30 

invention 
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Figure 5 is a schematic depiction of an apparatus for counting the switch 
requests for each output port of a packer switch according to an alternative 
embodiment of the present invention; 

Figure 6 is a graph comparing the performance of the iSLIP algorithm with 
5 that of the present invention; 

Figure ? is a graph showing the performance ratio of the iSLIP algorithm 
to that of the present invention; 

Figure 8 is a second graph comparing the performance of the iSLIP 
algorithm with that of the present invention; and 
"■O Figure 9 is a secono graph showing the performance ratio of the iSLIP 

algorithm to that of the present invention. 

As the scheduling of best-effort, connectionless cells within an input- 
buffered switch, router or network is bring considered, each of the input ports 
15 could be assumed to have a FIFO queue, each of which is destined for a different 
output port (i.e. virtual output queueing - VOQ). Although the flows are best- 
effort, we wish to schedule them on a frame-by-frame basis. However, there is no 
pre-reservation of time slots within this frame, thus a number f of cells are queued 
at each input port, in f time slots, and are being scheduled according to their 
20 output port destinations in such a way as to avoid contention. A particular cell or 
packet should be able to be transmitted across the switch fabric during any one of 
f time slots. Before performing fabnc arbitration to ensure that there is no output 
port contention in each time slot, we must first make sure that there is no 
overbooking of the input and output ports within the frame. 
25 If the total number of cells Nf (where N is the number of input ports and f 

is the number of time slots in a frame) to be switched across a cell or packet 
switch are to be computed together on a frame-by-frame basis, by means of a 
path-searching algorithm for a 3-stage circuit switch, then every cell could be 
represented as a port on the circuit switch. The number of computing steps 
30 needed to ensure no overbooking then depends on the amount of hardware that is 
acceptable. If 0(fNlogz(fN)) components are used, then O(fN) computing steps are 
needed. The number of computing steps can be reduced to 0(log2^(fN)| if more 
nardware is acceptable, i.e. 0(fNlog:-(fN)), using a Batcher sorting network, but 
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this hardware quant.ty may be too large to be acceptable. However, to represer,t 
every cell as a port on a circuit switch is an over-restrictive constraint m a cell 
switch in fact, in a cell sw.tch. it is only necessary to ensure that the number of 
cells destined for each of the N output ports does not exceed the number of cells 
5 or time-slots in the frame as there is no requirement to exit the output port .n any 
Specific time-slot. 

The present invention concems a method of for ensuring no overbooking of 
the mput and/or output ports. An NxN request matrix R is defined, whose 
elements r, ., represent the number of cells in input port i destined for output port 
10 The two conditions that ensure no overbooking are simply: 

^ = / for all / and Y. = ^'"^^ ' 

,n practice, ceils from more than f time slots in each input port could be 
15 considered in this procedure. ,f cells dest.ned for overbooked ports have to oe 
discarded. Discarded cells could eUher be lost completely, or continue to be 

queued for later attempts. 

,n order to establish the number of request matrix elements, counters 
are established, one for each queue. Figure 2 shows a schematic dep.ction of a 
20 oossible input port arrangement for counting the request matrix elements, r... 
Each of the N input ports 10 to a sw,tch fabnc 20 has N FIFO queues 11, N 
counters 1 2 and N switches 1 3 in order to direct cell requests to the appropriate 
counter 12. Assuming iust f cells in each port are counted within the request 
matrix each counter then requires log.(f counting stages, requiring 0(N'log:f) 
25 counter elements altogether. If ,t .s assumed that the individual cell destinat.on 
requests are input to these counters as single bits, there will be a max.mum of 
0(f) computing steps required of any counter, giving an overaU 
hardware.computing complexity product for the counters of OlfN^log.f). Figure 2 
Shows that we also require 0(N) switches in each input port to direct the cell 
30 requests to the correct counter, i.e. 0,N^, m total. The speed of these switches 
a,ust be sufficient that, within one frame of f slots. flogzN bits can be routed. The 
overall complex.ty product for the switches is therefore fN^log2N. 
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The method of queueing cell requests can be refined further. In Figure 2. 
each port has N FIFO queues, each capable of buffering up to f cells. This requires 
OffN'-log.N) buffer elements, each capable of toggling flog^N times per frame, i.e. 
a complexity product of 0(f^N^(log.N)^). Fortunately, by employing a single queue 
5 in the form of RAM in each port, acting as N virtual queues, the hardware count 
can be reduced to 0{fNlog.N). with the same number of steps per frame, requiring, 
a complexity product of 0(f-N(iog:N)^) overall. Since a particular ceil stored in 
RAM will be allocated to any of the f time slots within the frame, re-ordering may 
now also be required in an output queue if it is desired to preserve the cell order 
10 between mput and output corts. This is now in essence the same as traditional 
time-slot interchanging in time-shared circuit switches. However, it would be 
possible to preserve the cell order of a virtual input queue by allocating time slots 
in time order to the cells destined for the same output port. Even with efficient 
buffering in a single queue within each input port, cell buffering ,s the most 
15 complex function in a switch, requiring the largest quantity of the fastest 
electronics. The cells here are just the request cells, containing only output port 
destination addresses (and possibly input port addresses as well as other 
parameter values). Much greater buffering complexity is needed for the actual 
cells or packets carrying all the header information and oayload. 
20 Figure 3 shows an improveo arrangement for counting the reauest matrix 
elements. A serial input stream of cell requests is converted to a parallel word, 
which is then transmitted over a parallel bus 31. Each line of the parallel bus is 
connected via gates 32 to sections of the RAM 33, each RAM section 34 holding 
an individual cell request of log^N bits. Each RAM section can both read and 
25 write ceil requests from and to the parallel bus. As the cell requests are written 
into RAM sections 34, they are also decoded into single-bit requests by the 
decoder 35 and transmitted to the array of counters 36. Each input port requires 
an array 36 having N counters, so the requirement for an NxN switch is N'. The 
overall complexity product, which was previously dominated by the RAM queues. 
30 is now reduced by a factor iog^N. which for large N could be an order of 
magnitude. This reduction is in terms of RAM access speeds, rather than quantity 
of buffer memory. 
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one. all o, the matr.x elem.ms hava been counted, the next step of the 
method o, tne present ,n..nt,cn ,s to a.d the te..est o.atr. ^^ents . to .ot. 
,ne su. 0. the requests ,ot each output port oi .ha sw,tch. n^ote than 
uests ate ta.en .to account, then the su. o, the teouaats .,0. eac ,npu^ 
5 po„ .ust aisp he calculated, .igute . shows an attav o, ooun.ets each 
ntalnin, the nu.he, o, teouests .o, sw.tch.n, ,to. a „ven Input pott to a given 

pu u, pot,, e.,. count., n.2, holds teoues.s ,o, swi.ch.n, tton, t mput 

p to the second output pott and countet H.H, holds teouests ,ot sw,tch,n. 

.ne .Itst input po. to the Nth output pot,. The outputs .on, the countet 
; ,eed into an attav 0, N addets 42. .„h each addet cottespondin, to on o 
,he output potts o, the sw„ch ,aPt,c. Thus the outputs o, the countets wh ch 
hold teouests ,pt switching to the t„st output pott all connect to the .npu, o, 
addet 42 which Is assocated w,th tne ,ltst output pott o, the switch .aPnc The 
nts ntav .e teptesen.ed as lo.,.w,de wotds. each o, which ntust he sw.ch«. 
,5 su=cess,ve„ to the addet cltcu.trv 43 Pv an assocated switch attay 44. For 
conventional addet consttuct.ons the so.twate and hatdwate connplexlhes ate no 
gteatet then fot counting the ,nd,vidual request n^atrlx elements. 

The third step o. the method o, the present .nventlon is to contpar, the 
. summations ,or each output ppr. w,th -rh.ch ,s the maximum = 
20 that can be sen, from each input port and ,o each output port in each frame. 

ri: : cdumn of matr. exceeds ,. the number of repuests must 

number o, allowed repuests to a number proport.onal to the aCua, number of 



25 



requests, i.e. 



• • r 



This s,ep IS efficient when the there is a heavy ooncenttatlon of reouests on one. 
few input »t output ports. However, if the fraffic is dlsrr.buted un„orm v 

30 from ,nputs to outputs, such that each element mall - ; 

,hen the method Is less efficient. In an alternative embodiment 0, the ptesent 
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invention, the requests are allocated in the following manner. Allocation of 
requests must be performed fairly, so that every request has a chance of being 
granted within the frame, while satisfying the "no overbooking" condition, to keep 
the average delay seen by individual requests low. Furthermore, every virtual 
5 output queue must have a chance of being granted a request within the frame, to 
prevent starvation. To fulfil these requirements, each r,from ail input ports must 
be granted at least one request, if they ask for one or more requests, as one port, 
can not be granted a large number of requests while other ports are granted none. 
The tasks of summing requests and then reducing the number of requests 
10 as necessary are now replaced by a single mechanism which iteratively counts up 
the requests by granting one at a time to all r.: counters, until the sum of the 
requests destined for any output port equals the number f of slots in the frame. 
At that point no more requests can be granted to that output- On the first 
iteration, all request matrices, r,, with one or more requests for a given output 
15 port j are granted one of these requests, so that up to N requests may be granted 
in the first iteration (assuming here that f > N). Each of the non-zero request 
matrix, r counters is now decremented by one, (i.e. a 1 is subtracted from each 
log2f-wide word) ready for the next iteration. Meanwhile, the successful requests 
granted to each output port, v^hich are indicated as single bits in parallel from 
20 each r counter, are summed, for example by an adder with parallel input ports, 
converting the individual granted request bits into a word of length iog2i2N) bits. 

Figure 5 shows a circuit which can be used to implement the alternative 
embodiment of the invention. The circuit includes an array of request matrices 
41 las shown in Figure 4 and described above) and N adding elements'si. The 
25 adding elements 51 each comprise a parallel bit adder 52 which receives in 
parallel the single bit requests from each associated request matrix 41 and 
creates a log2(2N)-wide word. The output of each parallel bit adder 52 is 
connected to a respective log:(2f)-wide adder 53. 

There are N such parallel adders 5 2, each of which requires 4N binary 
30 adders. The number of steps required in each iteration to obtain the sum of the 
requests is l0g2N.log ■(2N)/2. Those r., counters that have a second request for a 
given output oort have a second single bit summed through the parallel adders in 
a second iteration of the process described above. Altogether there could be as 
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many as f iterations, if only one input port has requests destined for any one of 
the output ports. At the other extreme, for a uniform traffic distribution where a 
given output receives cells from different inputs, there could be just one iteration 
required. Because there is no buffering within the parallel adders, pipelining can be 
5 employed to sum each of the iterations. The maximum number of steps required 
for a maximum of f iterations is (f {og2N.log2(2N)/2). The com.piexities are 
summansed in Table 4. 

At the output of each parallel adder 52 ihere is a temporal succession of 
up to f log(2N)-wide words. The successive words must also be added 

10 sequentially by the adder 53 to ootain the overall total number of requests 
granted to an output port. Since this number cannot exceed f (i.e. the number of 
time slots in the frame) this can be done using a log2(2f)-wide adder. The adder 
construction in Figure 5 employs 0(log2-(2f)) half adders. Alternatively, a 
conventional adder construction employing !og;{2f) full adders could be used. 

15 The succession of f, log(2N)-wide words can be stored in a buffer of size 
f.log2<2N), if necessary. These can then clocked out sequentially at a suitable rate 
via switches for the following log2(2f)-wide adder. The iog2(2f )-wide adder 53 
calculates the overall total number of requests granted to all input ports destined 
for a particular output port {and this total must not exceed f). On the iteration that 

20 drives the overall total above f, the counting must be stopped for that outcut 
port. Each request matrix, r , counter v^/hose cell requests are destined for that 
output port must be advised and keep a record of the number of iterations !t took 
part in. It will only be allowed a number of requests equal to the number of 
iterations for which the overall count of requests for a given output port is <f. If 

25 the overall total on the previous iteration is less than f, then up to N additional 
requests may be allocated up to the total of f (when f > N). This can simply be 
done by examining the status of each request matrix, r counter -n turn in a 
maximum of N steps. When f < N. then up to f additional requests may be 
allocated. This may also take a maximum of N steps, by examining the status of 

30 each request matrix counter in turn, depending on the locations of the requests 
distributed around the N counters. Each of the request matrix counters now 
knows how many requests have been allocated to it. To maintain fairness 
between the input ports destined for a particular output port, a oointer may be 
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used, so that additional requests can be allocated preferentially to different input 
ports in different frames. (Of course, a specific pattern of requests could be such 
that the same input ports are in fact allocated additional requests in different 
frames). There are many ways in which a pointer could be used to allocate 
5 additional requests, including existing round-robin techniques. The simplest way 
would be to cycle the pointer continuously around the input ports, allocating one 
request at a time to any requesting port, stopping when the required number of 
additional requests up to N has been granted. The next frame that subsequently 
needs to allocate additional requests then begins from that pointer position. 

would be possible to reduce the N steps i-equired to allocate the 
additional requests to 0(log2N) steps, by using an (N.N) concentrator to pack 
requests next to each other, so that only the required number up to N of those 
concentrated requests can be gated or switched through to the parallel bit adder. 
The concentrator construction should preferably be such that the relative 
1 5 positions of the requests are preserved at the output of the concentrator, like that 
in T Szymanski. "Design principles for practical self-routing non-blocking 
switching networks with O(NlogN) bit-complexity. " IEEE Trans. On Computers, 
vol.46, no. 10, 1057-1069 (1997).. The gating or switching of the concentrator 
outputs could be achieved by means of a modified logzN-stage decoder, which 
20 not only produces, for example, a logic "I " on the appropriate numbered decoder 
output port (controlling the concentrator output port), to represent the last of up 
to N ports to be allowed through to the adder, but also propagates through its 
logiN stages a logic "1" to all decoder outputs (and controlled concentrator 
outputs) that lie above the last of up to N ports. In this way all decoder output 
25 ports above and including the decoded one provide enable bits ("1"s) to the 
concentrator output ports that they control, and all the decoder ports below the 
decoded one provide disable bits ("0"s) to the concentrator output ports that they 
control. Because the pointer could be in any of the N counter positions, and we 
wish preferably to allow through requests starting from the pointer position, the 
30 process of allocating additional requests could be split into two steps. 

In the first, only the requests including and below the pointer position are 
sent to the concentrator for gating or switching through to the adder. Disabling of 
the requests above the pointer position can be performed by a similarly modified 
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decoder in the second part, only those requests lying above the pointer position 
are allowed through to the concentrator. The overall hardware connplexity requ.red 
for the concentrator and decoder usmg this construct.on .s OON^log.N). and the 
number of computing steps using Szymanski s concentrator construct.on .s 
5 0{aiog:N). 

In a further alternative embodiment of the present invention, the counting 
and summing of requests could be performed as described above above, except 
that the counting and summation may continue beyond the iteration that dnves 
the overall total of requests above f for a particular output port, so that the 
10 complete total of requests can be counted. The ,og:,2fi-wide adder would now 
need to be a iog:.2fN)-wide adder, with a hardware count of 
Nlog.(2Mf).log.(4Nf);2: computing steps 0(flog:(2Nfn; and overall complexity 
product 0(Nflogr(2Nf).log.(4Nf).2), Furthermore, it could also be possible for the 
counters to hold cell request counts from previous frames, and to add these to 
15 ^he counts of each new frame. This would allow running totals, or perhaps 
weighted averages, to be used covering longer rime-slots than a single frame, .f 
cell queue lengths employed are longer than a frame length. The number of cell 
acceptances between input and output ports could then be calculated on perhaps 
a fairer basis related to longer-term flows between input and output ports. 
20 indeed, whether the actual queue lengths employed are longer than the 

frame length or not. i.e. even if unsuccessful cells are discarded within each 
frame it may be advantageous for the number of cell acceptances w.th.n each 
frarr^e to be related to such a longer-term measure of traffic flow requests 
between ports, rather than ,ust the cell requests within the frame itself. Of course 
25 the number of cell acceptances could also be related to a combination of longer- 
term flow and "within frame" requests. 

Once the number of requests r allocated between each input and output 
port within a frame is known, the individual ceil requests buffered in the RAM 
queues must be identified. This can be achieved by re-running the individual cell 
30 requests out of the RAM queues, through the decoder switches and re-counting 
them in the request matrix counters. This time, each counter is set at its allocated 
total and can be decremented by each cell request bit that it receives. While the 
cour^ter is still above zero, a single bit le.g. a "1 ") is sent back to the RAM queue 
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to the cell position corresponding to the current request bit, signifying cell 
acceptance. When a cell request bit decrements the counter to zero, and for all 
subsequent request bits, a single bit (e.g. a "0") is sent back to the appropriate 
RAM queue ceil position to signify non-acceptance of that particular cell request. 
5 The status (accepted or rejected) of all cell requests stored in RAM queues is now 
established. The hardware and connputing complexities are the same as for the 
first count of matrix elements r.„ except that an additional N(N h- f) switches are 
needed. The additional number of steps required through these is f. All accepted 
eel! requests are now ready to have their time slots calculated for transmission 
10 across the switch fabric {fabric arbitration). 

Excluding the necessary cell or packet buffering in RAM queues, for which 
there is no essential difference, the hardware. computing complexity product of 
the method of the present invention is OdogzN) smaller than iSLiP for all hardware 
Items. Taking the worst values across all hardware items, again excluding cell 
15 buffering, this actually translates Into fewer computing steps than iSLIP by the 
same factor 0(log2N), but at the expense of greater hardware quantities by the 
same factor OdogzN) for some of the hardware items. Nevertheless, tne hardware 
quantities required for the method of the present invention are much smaller than 
the ceil buffering hardware. 

Figure 6 shows the number of computing steps for the iSLIP algorithm with 
average convergence [solid line) and the method of the present invention (dashed 
line], for a small switch with N = 32 input and output ports, as a function of the 
number of time slots f under consideration. Figure 7 shows the ratio of computing 
steps for the two algorithms, which reaches a minimum of 0.32 for f = 32 time 
25 slots. Although this means that the method of the present invention is more than 
three times as fast an algorithm as iSLIP, it does require ceils to be buffered for 
32 time siots to achieve this. minimum ratio. However, the ratio is around 1.3 for 
all numbers of time-slots from 8 upwards, so any desired ceil latency could be 
chosen. The benefits of a frame-based algorithm become more significant for 
30 switches with more ports N. Figures 8 and 9 show the equivalent graphs as 
Figures 6 and 7 for the case where N = 256 ports. Here, the method of the 
present invention takes only 0.195 times the iSLIP computing time at minimum, 
requiring an optimum f - 1 92 time slots. Once again, the practical number of time 
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slots can be anything from 64 upwards, yet still provide around a 5-foid speed 
advantage. Thus significant computing time reductions are achievable even if the 
number of time slots f employed is made equal to the number of ports N. 

The overall hardware. computing complexity product of the method of the 
present invention is of order N times smaller than a maximum weight matching 
algorithm. In comparison, the iSLlP algorithm having average convergence is only 
0(N/log2N) smaller. 



10 
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CLAIMS 

1. A method of allocating switch requests within a packet switch, the 

method comprising the steps of 

5 (a) establishing switch request data at each input port; 

(b) processing the switch request data for each input port to generate 
request data for each input port-output port pairing; 

(c) comparing the number of requests from each input port and to each 
output Dort with the maximum request capacity of each input port and each 

0 output port; and 

(d) allocating all requests for those input-cutout pairs where the total 
number of requests is less than or eaual to the maximum request capacity of each 
input port and each output port; 

le) reducing the number of requests for those input-output pairs where 
5 the total number of requests is greater than the maximum request capacity of 
each input port and each output port such that the number of requests is less 
than or equal to the maximum request capacity of each input port and each 
output port; and 

(f) allocatmg the remaining requests. 

2. A metnod of allocating switch requests within a packet switch, the 

method comprising tne steps of: 

(a) establishing switch request data at each input port; 

(b) processing the switch request data for each input port to generate 
request data for each input port-output port pairing; 

(c) allocating a first switch request from each of the input port-output 
port pairing request data, the requests being allocated only if the maximum 
request capacity of the respective output port has not been reached; and 

(d) allocating further switch requests by the iterative application of step 
!c) until the maximum request capacity of each output port has been reached. 



3. A method of allocating switch reauests within a packet switch, the 

method comprising the steps of: 
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(a) establishing switch request data at each input port; 

(b) processing the switch request data for each input port to generate 
request data for each Input port-output port pairing; 

(c) identifying a first switcn request from each of the input port-output 

5 port pairing request data; 

(d) identifying further switch requests by the iterative application of 
step (c> until all of the switch request data has been identified; 

(e) subject to the maximum request capacity of each input port and 
each output port, allocating all of the identified switch requests; and 

TO (f) reserving unallocated s-vitch requests for use in the next phase of 

switch request allocation. 

4. A method of packet switching wherein the input port-output port routing 
IS allocated according to the methoo of any of claims 1-3 and the packets are 

15 switched on the basis of the allocated routing. 

5. A packet switch in which switch requests the input port-output port 
routing is allocated in accordance with the method of any of claims 1 to 3. 

20 6. A packet switch according to claim 5, wherein packets are switched from 

an input port to a specified output port m accordance with the allocated routing. 
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